How Many Bootstrap Replicates Are Necessary?
نویسندگان
چکیده
Phylogenetic bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on one of the most popular, maximum likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the relative accuracy of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis et al. recently introduced a BS algorithm that is 1 to 2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible. In this article, we propose stopping criteria--that is, thresholds computed at runtime to determine when enough replicates have been generated--and we report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA--single-gene as well as multi-gene--datasets, which include 125-2,554 taxa. We find that our stopping criteria typically stop computations after 100-500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes. Our results are thus twofold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through BS, and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100-500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at http://lcbb.epfl.ch/BS.tar.bz2, and BS with our stopping criteria is included in the latest release of RAxML v7.2.5, available at http://wwwkramer.in.tum.de/exelixis/software.html.
منابع مشابه
Confidence intervals for random forests: the jackknife and the infinitesimal jackknife
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and worki...
متن کاملConfidence Intervals for Random Forests Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and worki...
متن کاملEffects of parameter estimation on maximum-likelihood bootstrap analysis.
Bipartition support in maximum-likelihood (ML) analysis is most commonly assessed using the nonparametric bootstrap. Although bootstrap replicates should theoretically be analyzed in the same manner as the original data, model selection is almost never conducted for bootstrap replicates, substitution-model parameters are often fixed to their maximum-likelihood estimates (MLEs) for the empirical...
متن کاملStatistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock.
The statistical properties of sample estimation and bootstrap estimation of phylogenetic variability from a sample of nucleotide sequences are studied by using model trees of three taxa with an outgroup and by assuming a constant rate of nucleotide substitution. The maximum-parsimony method of tree reconstruction is used. An analytic formula is derived for estimating the sequence length that is...
متن کاملDetermining production level under uncertainty using fuzzy simulation and bootstrap technique, a case study
In every production plant, it is necessary to have an estimation of production level. Sometimes there are many parameters affective in this estimation. In this paper, it tried to find an appropriate estimation of production level for an industrial factory called Barez in an uncertain environment. We have considered a part of production line, which has different production time for different kin...
متن کاملThe Local Bootstrap for Markov processes
A nonparametric bootstrap procedure is proposed for stochastic processes which follow a general autoregressive structure. The procedure generates bootstrap replicates by locally resampling the original set of observations reproducing automatically its dependence properties. It avoids an initial nonparametric estimation of process characteristics in order to generate the pseudo-time series and t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 17 3 شماره
صفحات -
تاریخ انتشار 2009